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Abstract 

We propose a localized approach to multiple kernel learning that can be formulated as a convex opti¬ 
mization problem over a given cluster structure. For which we obtain generalization error guarantees and 
derive an optimization algorithm based on the Fenchel dual representation. Experiments on real-world 
datasets from the application domains of computational biology and computer vision show that convex 
localized multiple kernel learning can achieve higher prediction accuracies than its global and non-convex 
local counterparts. 
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1 Introduction 

Kernel-based methods such as support vector machines have found diverse applications due to their distinct 
merits such as the descent computational complexity, high usability, and the solid mathematical foundation 
[e.g.,Illl. The performance of such algorithms, however, crucially depends on the involved kernel function 
as it intrinsically specifies the feature space where the learning process is implemented, and thus provides a 
similarity measure on the input space. Yet in the standard setting of these methods the choice of the involved 
kernel is typically left to the user. 

A substantial step toward the complete automatization of kernel-based machine learning is achieved in 
Lanckriet et al. who introduce the multiple kernel learning (MKL) framework ifBI . MKL offers a 
principal way of encoding complementary information with distinct base kernels and automatically learning 
an optimal combination of those ia. MKL can be phrased as a single convex optimization problem, which 
facilitates the application of efficient numerical optimization strategies ElllTlIllllllllllllsaiMl and theoret¬ 
ical understanding of the generalization performance of the resulting models ifTOl fTTl 1^ l25l l2^ l3^ l48l 
l56l . While early sparsity-inducing approaches failed to live up to its expectations in terms of improvement 
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over uniform combinations of kernels [cf.|9] and references therein], it was shown that improved predictive 
accuracy can be achieved by employing appropriate regularization l^lSTIl . 

Currently, most of the existing algorithms fall into the global setting of MKL, in the sense that all input 
instances share the same kernel weights. However, this ignores the fact that instances may require sample- 
adaptive kernel weights. 

For instance, consider the two images of a horses given to the right. Multiple kernels can be defined, 
capturing the shapes in the image and the color distribution over various channels. On the image to the left, 
the depicted horse and the image backgrounds ex¬ 
hibit distinctly different color distributions, while 
for the image to the right the contrary is the case. 

Hence, a color kernel is more significant to detect a 
horse in the image to the left than for the image the 
right. This example motivates studying localized 
approaches to MKL lfT4l [T9l l3^ l42l l55l. 

Existing approaches to localized MKL (reviewed in Section o optimize non-convex objective func¬ 
tions. This puts their generalization ability into doubt. Indeed, besides the recent work by 041 . the gener¬ 
alization performance of localized MKL algorithms (as measured through large-deviation bounds) is poorly 
understood, which potentially could make these algorithms prone to overfitting. Further potential disad¬ 
vantages of non-convex localized MKL approaches include computationally difficulty in finding good local 
minima and the induced lack of reproducibility of results (due to varying local optima). 

This paper presents a convex formulation of localized multiple kernel learning, which is formulated as 
a single convex optimization problem over a precomputed cluster structure, obtained through a potentially 
convex or non-convex clustering method. We derive an efficient optimization algorithm based on Fenchel 
duality. Using Rademacher complexity theory, we establish large-deviation inequalities for localized MKL, 
showing that the smoothness in the cluster membership assignments crucially controls the generalization 
error. Computational experiments on data from the domains of computational biology and computer vision 
show that the proposed convex approach can achieve higher prediction accuracies than its global and non- 
convex local counterparts (up to 4-5% accuracy for splice site detection). 

1.1 Related Work 

Gonen and Alpaydin IH initiate the work on localized MKL by introducing gating models 

M 

fix) = I] rimix]v){wm,4>mix)) -f &, T]niix;v) oc exp((um,a;) -I- Vmo) 

m—1 

to achieve local assignments of kernel weights, resulting in a non-convex MKL problem. To not overly 
respect individual samples, Yang et al. Il55l give a group-sensitive formulation of localized MKL, where 
kernel weights vary at, instead of the example level, the group level. Mu and Zhou ll42l also introduce a 
non-uniform MKL allowing the kernel weights to vary at the cluster-level and tune the kernel weights under 
the graph embedding framework. Han and Liu El built on Gonen and Alpaydin III by complementing 
the spatial-similarity-based kernels with probability confidence kernels reflecting the likelihood of examples 
belonging to the same class. Li et al. propose a multiple kernel clustering method by maximizing local 
kernel alignments. Liu et al. Il37l present sample-adaptive approaches to localized MKL, where kernels can 
be switched on/off at the example level by introducing a latent binary vector for each individual sample, 
which and the kernel weights are then jointly optimized via margin maximization principle. Moeller et al. 
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ED present a unified viewpoint of localized MKL by interpreting gating functions in terms of local repro¬ 
ducing kernel Hilbert spaces acting on the data. All the aforementioned approaches to localized MKL are 
formulated in terms of non-convex optimization problems, and deep theoretical foundations in the form of 
generalization error or excess risk bounds are unknown. Although Cortes et al. CD present a convex ap¬ 
proach to MKL based on controlling the local Rademacher complexity, the meaning of locality is different 
in Cortes et al. CD: it refers to the localization of the hypothesis class, which can result in sharper excess 
risk bounds Il25ll26l . and is not related to localized multiple kernel learning. Liu et al. IIMlI extend the idea 
of sample-adaptive MKL to address the issue with missing kernel information on some examples. More 
recently. Lei et al. 041 propose a MKL method by decoupling the locality structure learning with a hard 
clustering strategy from optimizing the parameters in the spirit of multi-task learning. They also develop the 
first generalization error bounds for localized MKL. 

2 Convex Localized Multiple Kernel Learning 

2.1 Problem setting and notation 

Suppose that we are given n training samples . .., (a;„, j/„) that are partitioned into I disjoint clusters 

Si,Si in a probabilistic manner, meaning that, for each cluster Sj, we have a function : A” —> 
[0,1] indicating the likelihood of x falling into cluster j, i.e., ~ ^ x & X. Here, 

for any d S N, we introduce the notation = {1,. .., d}. Suppose that we are given M base kernels 
ki,...,kM withfcm(a;,i) = (a:), corresponding to linear models(a;) = {wj,(j){x))+b = 

Then we consider the 

following proposed model, which is a weighted combination of these I local models: 

( 1 ) 

iGNi jGN; ttiGNm 


2.2 Proposed convex localized MKL method 

Using the above notation, the proposed convex localized MKL model can be formulated as follows. 

Problem 1 (Convex Localized Multiple Kernel Learning (CLMKL)—Primal). Let C > 0 
and p > 1. Given a loss function (.{t, y) : M x > K convex w.r.t. the first argument and cluster likelihood 
functions Cj : X ^ [0,1], j € N;, solve 




inf y V 

W,t,l3,b 2/3,„j 

iGNi mGN„ 

s.t. fijm ^ 0, 'y ^ 


™)||2 


3P 

jrn 


iGN„ 

<1 Vj e Ni,TO e Nm 


ttiGNm 


ycj(a;i)[ y fmix i))] +b = ti, \/i gN, 

iGN( mGNM 


(P) 


The core idea of the above problem is to use cluster likelihood functions for each example and separate 
£p-norm constraint on the kernel weights fij := {fiji, ■ ■ ■, ^jm) for each cluster j ll3T1 . Thus each instance 
can obtain separate kernel weights. The above problem is convex, since a quadratic over a linear function is 
convex [e.g.,|6l p.g. 89]. Note that Slater’s condition can be directly checked, and thus strong duality holds. 
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2.3 Dualization 

In this section we derive a dual representation of Problem[T] We consider two levels of duality: a partially 
dualized problem, with fixed kernel weights jdjm, and the entirely dualized problem with respect to all 
occurring primal variables. From the former we derive an efficient two-step optimization scheme (Section 
[^. The latter allows us to compute the duality gap and thus to obtain a sound stopping condition for the 
proposed algorithm. We focus on the entirely dualized problem here. The partial dualization is deferred to 
Supplemental Material [C] 


Dual CLMKL Optimization Problem Forw^ = ..., wedefine the f 2 ,p-iiormby ||wj|| 2 ,p := 

JI^“^IUm)IIp = (EmGNM lkj’"^IIL)^-F°rafunction/i,wedenotebyfi*(a:) = sup^[x^/r- 
/i(/r)] its Fenchel-Legendre conjugate. This results in the following dual. 


Problem 2 (CLMKL— Dual). The dual problem of © is given by 


jGNl iGNn 


(D) 


Dualization. Using Lemma A.2 from Supplemental Material A.l to express the optimal Pjm in terms of 


w 


(m) 


, the problem Q is equivalent to 


inf 1 y f y 

w,t,b 2 V ^ 


2p . £±i 




jeNi ttiGNm tGN„ 

s.t. y [cj(a;i) y {w'f\<j)m{xt))]+b = ti,yieNn- 

jeNi mGNM 

Introducing Lagrangian multipliers ai,i G N„, the Lagrangian saddle problem of Eq. © is 


( 2 ) 


^ +C'y - y ai( y Cj(x,) y {wf^\(l)m{xi)) + b-ti 


jGNi mGNM 


iGN„ 


iGN„ fGNi 


= sup ^ y sup[-£(U,t/*) - - sup y Uib- 

“ ^ iGN„ ^ ieN„ 

[ XI X X 0'^Cj{x,)(l)m{x,)) - ^ X ( X 

fGNi toGNm ieN„ ieNi ttiGNm 

= sup I-C y - y [^||( y a*Cj(a;i)(5()^(a;,)) 


(m) II p+l 

j II2 


mGNM 


) ]} 


(3) 


M ||2 


The result (|^ now follows by recalling that for a norm || • ||, its dual norm || • |j* is defined by ||a:||* = 
supy^ll^i(x,/i) and satisfies: (P|| • |p)* = 3|| • ||^ Ifi). Furthermore, it is straightforward to show that 
II '112 ^ is the dual norm of 11-112 ■ □ 

’ p—1 ’p + l 


2.4 Representer Theorem 

We can use the above derivation to obtain a lower bound on the optimal value of the primal optimization 
problem from which we can compute the duality gap using the theorem below. The proof is given in 


Supplemental Material A.2 
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Theorem 3 (Representer Theorem). For any dual variable in the optimal primal variable 

in the Lagrangian saddle problem 0 can be represented as 

w^p\a)=[ ^ \\'^C(iCj{Xi)(l)m{Xi)\\^~y ^\\'^aiCj{x^)4>m{x^)\\^~^ ['^ aiCj{Xi)(j)miXi)]. 

toGNm iGN„ iGN„ iGN„ 


2.5 Support-Vector Classification 

For the hinge loss, the Fenchel-Legendre conjugate becomes £*{t, y) = ^ (a function of f) if — 1 < ^ < 0 
and 00 elsewise. Hence, for each i, the term £*{—^,yi) translates to provided that 0 < ^ < C. 

With a variable substitution of the form a"™ = the complete dual problem reduces as follows. 

Problem 4 (CLMKL—SVM Formulation). For the hinge loss, the dual CLMKL problem is given 
by: 


sup 

a:0<a<C,5]ig^^ aiyi=0 


2 'y ||( y ^iyi^ji.Xi)4‘m{.Xi)) 

ieNi iGN„ 


2 ^ + E 

iGN„ 


(4) 


A corresponding formulation for support-vector regression is given in Supplemental Material [B| 


3 Optimization Algorithms 

As pioneered in Sonnenburg et al. ia, we consider here a two-layer optimization procedure to solve the 
problem (|^ where the variables are divided into two groups: the group of kernel weights {/3jm}j m=i 
the group of weight vectors }j m=i' iteration, we alternatingly optimize one group of variables 

while hxing the other group of variables. These iterations are repeated until some optimality conditions are 
satished. To this aim, we need to hnd efficient strategies to solve the two subproblems. 

It is not difficult to show (cf. Supplemental Material|^ that, given hxed kernel weights (3 = {(3jm), the 
CLMKL dual problem is given by 




sup 


Q!i=0 


^ E E 

jGN; mGNjM 




iGN„ 


(5) 


which is a standard SVM problem using the kernel 


k{Xi,Xl) ■.= ^ l3jjnCj{Xi)Cj{xi}km{Xi,Xl} (6) 

"iGNm jGNi 

This allows us to employ very efficient existing SVM solvers m- In the degenerate case with c yx) e {0,1}, 
the kernel k would be supported over those sample pairs belonging to the same cluster. 

Next, we show that, the subproblem of optimizing the kernel weights for hxed and b has a closed- 
form solution. 

Proposition 5 (Solution oe the Subproblem w.r.t. the Kernel Weights). Given fixed and 
b, the minimal fijm in optimization problem 0 is attained for 

feGNiM 
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We defer the detailed proof to Supplemental Material |A.3| due to lack of space. To apply Proposition]^ 
for updating Pjm, we need to compute the norm of and this can be accomplished by the following 

representation of given fixed jSjm- (cf- Supplemental Material 

Wj — Pjm ^ ^ (8) 

iGN„ 

The prediction function is then derived by plugging the above representation into Eq. Q. 

The resulting optimization algorithm for CLMKL is shown in Algorithm The algorithm alternates 
between solving an SVM subproblem for fixed kernel weights (Line 4) and updating the kernel weights in 
a closed-form manner (Line 6). To improve the efficiency, we start with a crude precision and gradually 
improve the precision of solving the SVM subproblem. The proposed optimization approach can potentially 
be extended to an interleaved algorithm where the optimization of the MKL step is directly integrated into 
the SVM solver. Such a strategy can increase the computational efficiency by up to 1-2 orders of magnitude 
(cf. Il45l Ligure 7 in Kloft et al. ll3Tl i. The requirement to compute the kernel k at each iteration can be 
further relaxed by updating only some randomly selected kernel elements. 


Algorithm 1: Training algorithm for convex localized multiple kernel learning (CLMKL). 
input: examples {{xi, C (A x { — 1,1})"^ together with the likelihood functions {cj(a:)}j=i, M base 

kernels ki,, kM. 

1 initialize Pjm = = 0 for all j G N;, m € Nm 

2 while Optimality conditions are not satisfied do 

3 calculate the kernel matrix k by Eq. 

4 compute a by solving canonical SVM with k 

s compute || Hi for all j, m with given by Eq. ID 

6 update Pjm for all j, m according to Eq. 0 

7 end 


An alternative strategy would be to directly optimize (j^ (without the need of a two-step wrapper ap¬ 
proach). Such an approach has been presented in Sun et al. Il49l in the context of ip-norm MKL. 


3.1 Convergence Analysis of the Algorithm 


The theorem below, which is proved in Supplemental Material A.4 shows convergence of Algorithm^ The 
core idea is to view Algorithm [T] as an example of the classical block coordinate descent (BCD) method, 
convergence of which is well understood. 


Theorem 6 (Convergence analysis oe Algorithm]!]). Assume that 

(Bl) the feature map fmix) is of finite dimension, i.e, fmix) € M®"*, Cm < oo, Vm € Nm 
(B2) the loss function I is convex, continuous w.r.t. the first argument and f(0, y) < oo, Vy € y 
(B3) any iterate fijm traversed by Algorithm^has fijm > 0 

(B4) the SVM computation in line 4 ofAlgorithm^is solved exactly in each iteration. 

Then, any limit point of the sequence traversed by Algorithm^minimizes the problem (j^. 
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3.2 Runtime Complexity Analysis 

At each iteration of the training stage, we need 0{n^Ml) operations to calculate the kernel (|^, 0{n'^ns) 
operations to solve a standard SVM problem, 0{Mln^) operations to calculate the norm according to the 
representation ([^ and 0{MI) operations to update the kernel weights. Thus, the computational cost at each 
iteration is O^n^Ml). The time complexity at the test stage is OintUsMl). Here, Ug and rit are the number 
of support vectors and test points, respectively. 


4 Generalization Error Bounds 


In this section we present generalization error bounds for our approach. We give a purely data-dependent 
bound on the generalization error, which is obtained using Rademacher complexity theory m. To start with, 
our basic strategy is to plug the optimal f3jm, established in Eq. (0 into so as to equivalently rewrite © 
as a block-norm regularized problem as follows: 


1 


mm - 

w,b 2 


E[ E 


f \ 

„("^)|| P+1 


+ C ^ ^ Cj(xi)[ ^ {wf"\(j),n{xi))] +b,y^. 

jGNi ttiGNm mGNM 

Solving © corresponds to empirical risk minimization in the following hypothesis space: 


(9) 


Hp^D ■= Hp^D,M = \fw ■ X ^Cj{Xi)[ ^ (j)m{xi))] : ^ 

^ mGNM ^ 

The following theorem establishes the Rademacher complexity bounds for the function class Hp j), from 
which we derive generalization error bounds for CLMKL in Theorem [9] The proofs of the Theorems [9] 
are given in Supplemental Material [A.5| 

Definition 7. For a fixed sample S = (xi,..., Xn), the empirical Rademacher complexity of a hypothesis 
space H is defined as 

Rn{H) ■= Eo- sup - ^ (JifiXi), 

^ iGN„ 

where the expectation is taken w.r.t. cr = (cti, ..., (T„)^ with ai,i G being a sequence of independent 
uniform {±l'\-valued random variables. 


Theorem 8 (CLMKL Rademacher complexity bounds). The empirical Rademacher complexity of 
Hp D can be controlled by 


RniHp.o) < 


s/D 


inf t 


n 2 <t<^ , 

-p-i \ je 




iGN„ 


M 


m—1 


1/2 


( 10 ) 


If, additionally, km{x, x) < B for any x G X and any m G Nm, then we have 


RniRp^o) < 


Vdb 


inf 


n 2 <t<^ 

— — p-1 


(fM? ^ ^ c]{xi) 


/GN, jGN„ 


1/2 


Theorem 9 (CLMKL Generalization Error Bounds). Assume that km{x,x) < B,ym G f^M,x G 
X. Suppose the loss function i is L-Lipschitz and bounded by Bg. Then, the following inequality holds with 
probability larger than 1 — i5 over samples of size nfor all classifiers h G Hp 


1 






£,(/.) < £,..m + t £ £ 


ieNi ieN„ 


where £e{h) := E[e{h{x), y)] and Si^.^{h) := ^ J2iGN„ 


The above bound enjoys a mild dependence on the number of kernels. One can show (cf. Supplemental 


A.5 I that the dependence is O(logM) for p < (logM — 1) ^ logM and 0{M ) otherwise. 


Material 

In particular, the dependence is logarithmically for p = 1 (sparsity-inducing CLMKL). These dependencies 
recover the best known results for global MKL algorithms in Cortes et al. nol . Kloft and Blanchard l25l, 
Kloft et al. 11311 . 

The bounds of Theorem [^exhibit a strong dependence on the likelihood functions, which inspires us to 
derive a new algorithmic strategy as follows. Consider the special case where Cj{x) takes values in {0,1} 
(hard cluster membership assignment), and thus the term determining the bound has X^mGNM = 

n. On the other hand, if c jix) = jJ G Ni (uniform cluster membership assignment), we have the favorable 
term Sign ~ f - This motivates us to introduce a parameter r controlling the complexity of 

the bound by considering likelihood functions of the form 


Cj{x) cx exp(—Tdist^(a;, Sj)), 


( 11 ) 


where dist(a;, Sj) is the distance between the example x and the cluster Sj. By letting t = 0 and r = oo, 
we recover uniform and hard cluster assignments, respectively. Intermediate values of r correspond to more 
balanced cluster assignments. As illustrated by Theorem]^ by tuning r we optimally adjust the resulting 
models’ complexities. 


5 Empirical Analysis and Applications 


5.1 Experimental Setup 

We implement the proposed convex localized MKL (CLMKL) algorithm in MATLAB and solve the involved 
canonical SVM problem with LIBSVM |[8|. The clusters {Si,..., Si} are computed through kernel k-means 
[e.g.,[Il, but in principle other clustering methods (including convex ones such as Hocking et al. ED) could 
be used. To further diminish k-means’ potential fluctuations (which are due to random initialization of the 
cluster means), we repeat kernel k-means t times, and choose the one with minimal clustering error (the 
summation of the squared distance between the examples and the associated nearest cluster) as the final 
partition {S'!,..., S';}. To tune the parameter t in Q in a uniform manner, we introduce the notation 


AE(r) 


1 exp(—Tdist^(a:i, Sj)) 

exp(-Tdist2(a:„S'j)) 


to measure the average evenness (or average excess over hard partition) of the likelihood function. It can be 
checked that AE(r) is a strictly decreasing function of r, taking value 1 at the point r = 0 and l~^ at the 
point T = 00 . Instead of tuning the parameter r directly, we propose to tune the average excess/evenness 
over a subset in 1]. The associated parameter r are then fixed by the standard binary search algorithm. 

We compare the performance attained by the proposed CLMKL to regular localized MKL (LMKL) 1141 . 
localized MKL based on hard clustering (HLMKL) ll34l . the SVM using a uniform kernel combination 
(UNIE) |[9|, and fp-norm MKL OTll . which includes classical MKL 1321 as a special case. We optimize £p- 
norm MKL and CLMKL until the relative duality gap drops below 0.001. The calculation of the gradients 
in LMKL ifTH requires 0{n?M^d) operations, which scales poorly, and the definition of the gating model 
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(a) Splice (b) TSS 

Figure 1: Results of the gene finding experiments: splice site detection (left) and transcription start site detection (right). 
To clean the presentation, results for UNIF are not given here. The parameter p for CLMKL, HLMKL and MKL is set 
as 1 here. 

requires the information of primitive features, which is not available for the biological applications studied 
below, all of which involve string kernels. In Supplemental Material|^ we therefore give a fast and general 
formulation of LMKL, which requires only O(n^M) operations per iteration. Our implementation of which 
is available from the following webpage, together with our CLMKL implementation and scripts to reproduce 
the experiments: 

https://www.dropbox.com/sh/hkkfa0ghxzuig03/AADRdtSSdUSm8hfVbsdjcRqva?dl=0 

In the following we report detailed results for various real-world experiments. Further details are shown 
in Supplemental Material]^ 

5.2 Splice Site Recognition 

Our first experiment aims at detecting splice sites in the organism Caenorhabditis elegans, which is an im¬ 
portant task in computational gene finding as splice sites are located on the DNA strang right at the boundary 
of exons (which code for proteins) and introns (which do not). We experiment on the mkl-splice data 
set, which we download from http: //mldata. org/repository/data/viewslug/mkl- splice / , It 
includes 1000 splice site instances and 20 weighted-degree kernels with degrees ranging from 1 to 20 a. 
The experimental setup for this experiment is as follows. We create random splits of this dataset into training 
set, validation set and test set, with size of training set traversing over the set {50,100, 200, 300,..., 800}. 
We apply kernel-kmeans with uniform kernel to generate a partition with I = 3 clusters for both CLMKL and 
HLMKL, and use this kernel to define the gating model in LMKL. To be consistent with previous studies, we 
use the area under the ROC curve (AUC) as an evaluation criterion. We tune the SVM regularization param¬ 
eter from and the average evenness over the interval [0.4,0.8] with eight linearly equally 

spaced points, based on the AUCs on the validation set. All the base kernel matrices are multiplicatively 
normalized before training. We repeat the experiment 50 times, and report mean AUCs on the test set as 
well as standard deviation. Figure[T](a) shows the results as a function of the training set size n. 

We observe that CLMKL achieves, for all n, a significant gain over all baselines. This improvement 
is especially strong for small n. For n = 50, CLMKL attains 90.9% accuracy, while the best baseline 
only achieves 85.4%, improving by 5.5%. Detailed results with standard deviation are reported in Table 
[1] A hypothetical explanation of the improvement from CLMKL is that splice sites are characterized by 
nucleotide sequences—so-called motifs —the length of which may differ from site to site HtI . The 20 
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employed kernels count matching subsequences of length 1 to 20, respectively. For sites characterized by 
smaller motifs, low-degree WD-kernels are thus more effective than high-degree ones, and vice versa for 
sites containing longer motifs. 



50 

100 

200 

300 

400 

500 

600 

700 

800 

UNIF 

79.5±2.8» 

84.2±2.2» 

88.0±1.7» 

90.0±1.7. 

91.6±1.5» 

92.4±1.5» 

93.3il.7» 

93.6±1.7. 

93.8i2.3» 

LMKL 

79.8±2.7» 

84.2±2.3» 

88.4±1.7» 

90.5±1.7» 

91.9±1.5» 

92.8±1.5» 

93.7il.6» 

94.1±1.7» 

94.3±2.2» 

MKL,p=l 

80.2±2.8» 

85.2±2.0» 

89.2±1.6» 

91.1±1.6» 

92.5il.5» 

93.1±1.5» 

93.9il.4. 

94.0±1.6» 

94.2±2.1» 

MKL,p=2 

79.6±2.8» 

84.3±2.2* 

88.3±1.7. 

90.4±1.6» 

91.8il.5» 

92.5il.5» 

93.4il.6» 

93.6±1.6» 

93.8i2.2. 

MKL,p=L33 

79.7±2.9» 

84.6±2.1* 

88.6±1.7. 

90.6±1.6» 

92.0±1.5» 

92.7±1.5» 

93.5il.5» 

93.7±1.6» 

93.8i2.1. 

HLMKL,p=l 

84.9±2.0» 

87.7±1.8» 

90.4±1.6» 

91.5il.4» 

93.0il.3» 

92.9±1.6» 

93.9il.5» 

94.3±1.6» 

95.0±2.0» 

HLMKL,p=2 

84.9±2.0» 

87.0±1.7» 

90.4±1.4» 

91.1±1.6» 

92.6il.4» 

93.5il.6» 

94.7il.4. 

94.6±1.4» 

94.4i2.2» 

HLMKL,p=L33 

85.4±1.9» 

88.5±1.7» 

90.1±1.6» 

91.7±1.4. 

92.7±1.2» 

93.4il.6» 

94.6il.5» 

94.4±1.7» 

94.4±2.1» 

CLMKL,p=l 

90.9T1.6 

91.3T1.4. 

93.3±1.2. 

93.8±1.2. 

94.3il.0» 

94.8il.2 

95.3il.3» 

95.1±1.4» 

95.2±2.0» 

CLMKL,p=2 

90.5±1.6» 

92.3±1.2» 

93.Oil.2. 

94.0±1.2 

94.4±1.1» 

94.7±1.2. 

95.4il.4» 

95.3±1.5» 

95.6±1.9» 

CLMKL,p=L33 

90.9T1.5 

90.1±1.3 

92.7±1.2 

94.1±1.2 

94.8±1.1 

94.9±1.1 

95.Oil.2 

95.4il.5 

95.4±1.9 


Table 1: Performances achieved by LMKL, UNIF, regular fp MKL, HLMKL, and CLMKL on Splice Dataset. • 
indicates that CLMKL with p — 1.33 is significantly better than the compared method (paired t-tests at 95% significance 
level). 


5.3 Transcription Start Site Detection 

Our next experiment aims at detecting transcription start sites (TSS) of RNA Polymerase II binding genes 
in genomic DNA sequences. We experiment on the TSS data set, which we downloaded from http: // 
mldata . org/repository/data/viewslug/tss/ This data set, which is included in the larger study of 
1461, comes with 5 kernels. The SVM based on the uniform combination of these 5 kernels was found to have 
the highest overall performance among 19 promoter prediction programs H]. It therefore constitutes a strong 
baseline. To be consistent with previous studies miiMissi, we use the area under the ROC curve (AUC) as 
an evaluation criterion. We consider the same experimental setup as in the splice detection experiment. The 
gating function and the partition are computed with the TSS kernel, which carries most of the discriminative 
information l46l . All kernel matrices were normalized with respect to their trace, prior to the experiment. 
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UNIF 

83.9±2.4. 

8e.2il.3» 

87.eil.0» 

88.4i0.9» 

88.7i0.9» 

89.1i0.9» 

89.2il.0» 

89.0il.l» 

89.8il.l» 

LMKL 

85.2±1.2» 

85.9il.l» 

80.6il.l» 

87.1il.0» 

87.2i0.9» 

87.3il.0» 

87.5il.0» 

88.1il.l» 

88.7il.3» 

MKL,p=l 

86.0±1.7» 

87.7il.0» 

88.9i0.9» 

89.0i0.9» 

90.0i0.9» 

90.3i0.9» 

90.5i0.9 

91.0i0.9 

91.2i0.9 

MKL,p=2 

85.1i2.0» 

86.9il.l» 

88.1i0.9» 

88.8i0.9» 

89.2i0.9» 

89.0i0.9» 

89.8il.0» 

90.3il.0» 

90.7i0.9» 

MKL,p=L33 

85.7il.8» 

87.5il.0» 

88.7i0.9» 

89.4i0.9» 

89.8i0.9» 

90.2i0.9» 

90.4i0.9» 

90.9i0.9» 

91.2i0.9» 

HLMKL,p=l 

86.8il.2. 

87.8il.0» 

88.7i0.9» 

89.4i0.9» 

89.8il.0» 

90.0il.0» 

90.4il.0» 

90.7il.0» 

91.0il.0» 

HLMKL,p=2 

86.3±1.4. 

87.5il.0» 

88.5i0.9» 

89.3i0.9» 

89.4i0.9» 

89.7i0.9» 

89.8il.0» 

90.3il.l» 

90.5il.0» 

HLMKL,p=L33 

86.5±1.4. 

87.7il.l» 

88.7i0.9» 

89.3i0.9» 

89.8il.0» 

90.1i0.9» 

90.2il.0» 

90.7il.0» 

91.0i0.9» 

CLMKL,p=I 

87.Oil.2 

88.5il.O 

89.4i0.8 

90.0i0.9 

90.3i0.9» 

90.ei0.9 

90.8i0.9» 

91.2i0.9» 

91.4i0.9 

CLMKL,p=2 

87.3il.3» 

88.3il.0» 

89.1i0.8» 

89.0i0.8» 

89.9i0.9» 

90.2i0.9» 

90.3i0.9» 

90.7il.0» 

90.9i0.9» 

CLMKL,p=L33 

87.Oil.2 

88.ei0.9 

89.4i0.8 

89.9i0.9 

90.2i0.9 

90.5i0.9 

90.Oil.0 

91.1il.O 

91.3i0.9 


Table 2: Performances achieved by LMKL, UNIF, regular £p MKL, HLMKL and CLMKL on TSS Dataset. • indicates 
that CLMKL with p = 1.33 is significantly better than the compared method (paired t-tests at 95% significance level). 


Figure[T](b) shows the AUCs on the test data sets as a function of the number of training examples.We 
observe that CLMKL attains a consistent improvement over other competing methods. Again, this improve¬ 
ment is most significant when n is small. Detailed results with standard deviation are reported in Table 

El 
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5.4 Protein Fold Prediction 


Protein fold prediction is a key step towards understanding the function of proteins, as the folding class 
of a protein is closely linked with its function; thus it is crucial for drug design. We experiment on the 
protein folding class prediction dataset by Ding and Dubchak ina, which was also used in Campbell and 
Ying 171, Kloft 1241 . Kloft and Blanchard l25l . This dataset consists of 27 fold classes with 311 proteins 
used for training and 383 proteins for testing. We use exactly the same 12 kernels as in Campbell and Ying 
171 . Kloft l24ll . Kloft and Blanchard l25]l reflecting different features, such as van der Waals volume, polarity 
and hydrophobicity. We precisely replicate the experimental setup of previous experiments by Campbell and 
Ying Q, Kloft l24ll . Kloft and Blanchard l25]l . which is detailed in Supplementary Material [E.l| We report 
the mean prediction accuracies, as well as standard deviations in Table 

The results show that CLMKL surpasses regular ip-norm MKL for all values of p, and achieves ac¬ 
curacies up to 0.6% higher than the one reported in Kloft ll24ll . which is higher than the initially reported 
accuracies in Campbell and Ying Q. LMKL works poorly in this dataset, possibly because LMKL based 
on precomputed custom kernels requires to optimize nM additional variables, which may overflt. 


5.5 Visual Image Categorization—UIUC Sports 

We experiment on the UIUC Sports event dataset If35l consisting of 1574 images, belonging to 8 image 
classes of sports activities. We compute 9 x^-kernels based on SIFT features and global color histograms, 
which is described in detail in Supplemental Material E.2 where we also give background on the experi¬ 
mental setup. 

From the results shown in Table we observe that CLMKL achieves a performance improvement by 
0.26% over the £p-norm MKL baseline while localized MKL as in Gonen and Alpaydin IIT4l underperforms 
the MKL baseline. 


5.6 Execution Time Experiments 


To demonstrate the efficiency of the proposed implementation, we compare the training time for UNIF, 
LMKL, ip-norm MKL, HLMKL and CLMKL on the TSS dataset. We fix the regularization parameter 
C = 1. We fix Z = 3 and AE = 0.5 for CLMKL, 
and fix Z = 3 for HLMKL. On the image to the 
right, we plot the training time versus the training 
set size. We repeat the experiment 20 times and re¬ 
port the average training time here. We optimize 
CLMKL, HLMKL and MKL until the relative gap 
is under 10“^. The figure implies that CLMKL con¬ 
verges faster than LMKL. Furthermore, training an 
f 2 -norm MKL requires significantly less time than 
training an £i-norm MKL, which is consistent with 
the fact that the dual problem of i 2 -noTm MKL is 
much smoother than the £i-norm counterpart. 



number of training exampies 



UNIF 

LMKL 

MKL 

HLMKL 

CLMKL 

p^l 

p = 1.2 

p = 2 

p^l 

p = 1.2 

p = 2 

p ^ 1 

II 

to 

p=2 

ACC 

68.4. 

64,3. 

68.7. 

74.2* 

70.8. 

72.7 ± 1.3. 

74.6 ± 0.6 

72.4 ±0.8. 

71.3 ± 0.5. 

75.0 ± 0.7 

71.7 ±0.5. 


Table 3: Results of the protein fold prediction experiment. • indicates that CLMKL with p = 1.2 is significantly better 
than the compared method (paired t-tests at 95% significance level). 
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MKL 

LMKL 

CLMKL 


MKL 

LMKL 

CLMKL 

ACC 

90.00 

87.29 

90.26 

A 

0+11=0- 

0+ 1= 10- 

4+ 6=1- 


Table 4: Results of the visual image recognition experiment on the UIUC sports dataset. A indicates on how many outer 
cross validation test splits a method is worse (n— ), equal (n =) or better (n+) than MKL. 

6 Conclusions 

Localized approaches to multiple kernel learning allow for flexible distribution of kernel weights over the 
input space, which can be a great advantage when samples require varying kernel importance. As we show 
in this paper, this can be the case in image recognition and several computational biology applications. 
However, almost prevalent approaches to localized MKL require solving difficult non-convex optimization 
problems, which makes them potentially prone to overfltting as theoretical guarantees such as generalization 
error bounds are yet unknown. 

In this paper, we propose a theoretically grounded approach to localized MKL, consisting of two subse¬ 
quent steps: 1. clustering the training instances and 2. computation of the kernel weights for each cluster 
through a single convex optimization problem. For which we derive an efficient optimization algorithm 
based on Fenchel duality. Using Rademacher complexity theory, we establish large-deviation inequalities 
for localized MKL, showing that the smoothness in the cluster membership assignments crucially controls 
the generalization error. The proposed method is well suited for deployment in the domains of computer 
vision and computational biology. For splice site detection, CLMKL achieves up to 5% higher accuracy 
than its global and non-convex localized counterparts. 

Future work could analyze extension of the methodology to semi-supervised learning iiniiia or using 
different clustering objectives 112111521 and how to principally include the construction of the data partition 
into our framework by constructing partitions that can capture the local variation of prediction importance 
of different features. 
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Supplemental Material 

A Lemmata and Proofs 

A.l Lemmata Used for Dualization in Section |23] 

Let Hi, ..., Hm be M Hilbert spaces andp > 1. Define the function gp{vi, ..., vm) : iFi x • • • x Hm —> K 
by ^ 

gp{vi,...,VM) = ^\\{vi,---,vm)\\1^p, P> 1- 

For any p > 1, denote by p* the conjugated exponent satisfying ^ -f ^ = 1. 


12 













Lemma A.l. The gradient of gp is 

dgp{vi,.. .,vm) 

Proof. By the chain rule, we have 


= [ E 


Wm 


m\\2\ 




TTI^Nm 


dgp{vi,...,vi^ ^ 1 |- ^ 
^ toGNm 


dVr, 


1 


: E 

itiGNm 


\P] P 


1 §~1 d{Vm') '^m) ^ 
f)!) 


= [ E 


dVrt 


\\Vm\\2 ^^m- 


ij 


£_1 
o -*- 




□ 


Lemma A.l (Micchelli and Pontill40]l. Let > 0, z S and 1 < r < oo. Then 

l+i 






and the minimum is attained at gi = al'^^ ^ X^feGNd 

A.2 Proof of Representer Theorem (Theorem]^ 

Proof of Theorem^ In our derivation (|^ of the dual problem, the variable Wj{a) := {wf\a),... ,wf^\a)) 
should meet the optimality in the sense 


Wj(a) = arg max 




^ i—1 

Since (v/)~^ = V/* for any convex function / and the Fenchel-conjugate of gp is gp*, we obtain 

Wj{a) = ( E OiiCj{x^)(j)i{xi), ■ ■ ■, ^ aiCj{xi)4>Mix^)^ 

iGN„ iGN„ 

= V5^ ( E o:^Cj{xi)4>i{xi), • ■ •, ^ aiCj{xi)4>M{xi)^ 
iGN„ iGN„ 

Ler^O^ ^ II ^ aiCj{Xi)(i)rh{Xi)\\^-^Y'^ {^\ ^ a^Cj{Xi)(l)i{Xi)\\^ [ ^ a,Cj[Xi)(i)i{Xi)\, 
AGNm iGN„ ieN„ iGN„ 

...,|| ^ a^c^{xi)4lM{xi)'^^-^ aiCj{xi)4>M{xi)\^. 

iGN„ iGN„ 

Note that the above derivation uses Lemma [ATT] from Supplemental Material [A. 1| □ 


A.3 Proof of Proposition 

Proof of Proposition^ Fixing the variables and b, the optimization problem Q reduces to 

E E OR. ^ E > 0, Vj e Ni,TO e Nm- 

j^Ni ttiGNm itiGNm 

13 









This problem can be decomposed into I independent subproblems, one at each locality. For example, the 
subproblem at the j-th locality is as follows 


min 

0 


E 


\\W 


(" i )||2 


2f3j 


E ^ > 0, Vm G Nm- 

TTL^Nm 


Applying Lemma 


A.2 


with am = l|tu! 


Vm = 13jm and r = p completes the proof. 


□ 


A.4 Proof of Theorem 1^ Convergence of the CLMKL Optimization Algorithm 

The following lemma is a direct consequence of Lemma 3.1 and Theorem 4.1 in Tseng ll50l . 

Lemma A.3. Let f : —>■ M U {oo} be a function. Put d = di + • • ■ + d^. Suppose that f 

can be decomposed into f{ai ,..., an) + fr{o:r) for some /o : —>■ M U {oo} and fr : —>■ 

KU{oo},r G Nij. Initialize the block coordinate descent method by a^ = (ajj • • • j Define the iterates 
a’^ = {a\m ■ ■ .off)} by 

= arg min VrGNfl,fcGN+. (A.l) 

uGR'*'" 

Assume that 


(Al) f is convex and proper, i.e., f ^ oo 

(A2) the sublevel set := {a G : f{a) < f{a^)} is compact and / is continuous on A^ 

(A3) dom{ff) := {a G : /o(q;) < oo} is open and fo is continuously differentiable on dom(fo). 
Then, the minimizer in exists and any limit point of the sequence (a*^)feGN+ minimizes f over A^. 

The proof of Theorem]^ is a direct consequence of the above lemma. 

Proof of Theorem^ The primal problem (|^ can be rewritten as follows; 


II 2 

w) )\i 


inf \ ^ \ '' 

w.pjGOp.jeNt ^ ^ 2j3jm 

jefii rnGNjif ■' 


+ C'^£{'^Cj{xi) ^^rn{xi)),yf), (A.2) 

ttiGNm 


where ©p = {(6»i,...,6 »m) e : 9^ > 0j|(^m)m^illp < 1}, and 

Note that Eq. (|A.2[) can be written as the following unconstrained problem: 




,Vj G Ni,m G Nm- 


inf/(w,/3), where/(w,/3) =/o(u;,/3) +/i(tt;) +/2(/3), 

w,p 


with 


fo{w,p) = E E 




jGNi meNM 


L 2/3, 


' 3[0jm>o{l3jm) 


'jm 


fiM = CY ^(E E and /2(/3) = Y ^II/3.IIp<i(/5j)- 

jeN„ :/gN( ttigNm jeNi 

Here / is the indicator function, i.e., /s(s) = 0 if s G 5” and oo otherwise. 

Now, it remains to check the assumption (Al), (A2) and (A3) in Lemma A.3 for Algorithm[T] 
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Validity of A1. It is known that a quadratic over a linear function is convex, so the term StogNm ^ 

is convex. Also, since i is convex w.r.t. the first argument and the term C, {Xi) Y.mmM 

is a linear function of w, we immediately know the term X^iGN ^(X^jgn, i.Xi) EmGN„ {w'f\(t>m{Xi)),yi) 
is convex. The convexity of / follows immediately. For the initial assignment with 

^(m, 0 ) _ Q ^( 0 ) ^ {P^^}jrn with we know 

= Cnsup£{0,y) < oo 
yey 

and therefore / is proper. 

Validity of A2. Recall that our algorithm is initialized with = 0 and For any 

{w,l3) e A° := {{w,P) : f{w,^) < Cnsup^^gj; f(0, j/)}, we have 

||w -'"^||2 < 2l3jrnCnsnpe{0,y) < 2(771 supf(0,y), 

yey yey 

which, coupled with the constraint \\{Pjm)m=i\\p < IjVj S N/, immediately shows that A'^ is bounded. 
Furthermore, since /o,/i and /2 are continuous on their respective domains, the function / is therefore 
continuous on C dom(/o) H dom(/i) n dom(/ 2 ). It is also known that the preimage of a closed set is 
closed under a continuous function, from which we know the set A^ = /' ^(—oo,is closed. 

Any closed and bounded subset in d € N is compact and thus is compact. 

Validity of A3. Clearly, dom(/o) = {{w,l3) : /3 > 0} is open and /q is continuously differentiable 
ondom(/o). □ 

A.5 Proof of Generalization Error Bounds (Theorem 

In this section we present the proof of the achieved generalization error bounds (Theorem|^in the main text). 

Denote p = for any p > 1 and observe that p < 2, which implies p* > 2. To start with, we give a 
discussion on the interpretation and tightness of Rademacher complexity bounds in Theoremj^ 

Interpretation and Tightness of Rademacher complexity hounds It can be directly checked that the 
function x 2;_/tp2/x 

is decreasing along the interval (0, 2 log M) and increasing along the interval (2 log M, oo). 
Therefore, under the assumption km{x, x) < B the Rademacher complexity bounds thus satisfy the inequal¬ 
ities; 

i^2ei7BlogME^.g„^ E*gn„ P ^ loglf^i ^ 

i \l EiGN„ c]{x,)], otherwise. 

In particular, the former expression can be taken forp = 1, resulting in a mild logarithmic dependence on the 
number of kernels. Note that in the limiting case of just one cluster, i.e., I = 1, the Rademacher complexity 
bounds match the result by Cortes et al. m, which was shown to be tight. 

The proof of Theorem]^ is based on the following lemmata. 

Lemma A.4 (Khintchine-Kahane inequality ||23]| '). Let vi,... ,Vn G B. Then, for any q>l, it holds 

Eo-ll ^ < 

ieN„ 



Rn{Hp^D) < 


DB 
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Lemma A.5 (Block-structured Holder inequality 1261). Let X = ..., y = ..., S 

n = ni X • • • X LLn- Then, for any p > 1, it holds 


{x,y) < ||a;|| 2 ,p||y|| 2 ,p*- 

Proof of Theorem^ Firstly, for any 1 < f < 2 we can apply a block-structured version of Holder inequality 
to bound by 

^ (XifwiXi) = X! X! X! {'^f^\(t>m{Xi)) 

iGN„ iGN„ jetii mGNM 

Cj{Xi){Wj,(t>{Xi)) 


iGN„ jGNi 

= Y {'^3' Y ^tCj{Xi)(t){Xi) 


jGNi ' iGP 


(A.3) 


Holder 

< 


Y Y ^iCj{x^)(l){Xi) 

iSNi iGN„ 


c. s 

< 


[ Y "[ 51 II XI ^^Cj{x^)cj){xi) 


fGN, 


iGNi iGN„ 


2 

2,t* 


For any j G Nj, the Khintchine-Kahane (K.-K.) inequality and Jensen inequality (since t* > 2) permit us to 

9 


bound E, 




2,t* 


by 


E^ ^ ^ cT'iCjixf^fixf) 


= E.. 


2,t* 


X I X 


Jensen 

< 


K.-K. 

< 


= f 


E- X II X (7iCj (^Xi^cf) 

m (^i) 

mGNM 2^N^ 


= f 


X X <^'j(^i)Um{x^)\\l 

E (E c^j{xi)k 

m (^z; 

m^'NM ^^Nn 

/ X M 


Plugging the above inequalities into Eq. ( |A.3| ) and noticing the trivial inequality ||tUj|| 2 ,t E ||wj|| 2 ,p)Vf > 
p > 1, we get the following bound; 

Rn{.Hp,D) < inf ( X -1 -.) 

iGNi iGN„ 

The above inequality can be equivalently written as Eq. 

Under the condition km{x, x) < B, the term in the brace of Eq. ([Tg can be controlled by 


X ( X 

j^f^i zGN„ 


M 


^ = X X (X c]{Xi)km{Xi,Xi))^- 

I j&Ni ^mGNM "i^Nn 

< BMi J2 X 


jGNi iGN„ 
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Therefore, the inequality ( [TOl i further translates to 

Vdb 


RniHp^o) < - inf (tM^ [ Y 

n 2 <t<^ V ^ V 

— — p-i 




□ 


Proof of Theorem^ The proof now simply follows by plugging in the bound of Theoremj^into Theorem 7 
of Bartlett and Mendelson E). □ 


B Support Vector Regression Formulation of CLMKL 

For the e-insensitive loss i{t,y) = [\y — t\ — e]+, denoting a+ = max(a, 0) for all a G K, we have 
= — ^aiyi + e\^ \ if \ai\ < C and oo elsewise 1201 . Hence, the complete dual problem (|^ 

reduces to 


“ jgn, »gn„ 

s.t. gj = 0, 


iGF 


(B.l) 


iGN„ 

|cti| fiC, Wi G 


Let g^, g^ > 0 be the positive and negative parts of g^, that is, g^ = af — , |gi| = af + . Then, 

the optimization problem ( |B.ll translates as follows. 

Problem B.l (CLMKL—Regression Problem). For the e-insensitive loss, the dual CLMKL problem is 
given by: 


sup - - 5]] |( 5]] Cj{Xi){a+ -a^ 


9 

^’p-1 


Y - e 51 + “i ) 


0<af,a^ <(7, g+g^ =0, Vi G N„. 


(B.2) 


C Primal and Dual CLMKL Problem Given Fixed Kernel Weights 

Temporarily fixing the kernel weights /?, the partial primal and dual optimization problems become as fol¬ 
lows. 

Problem C.l (Primal CLMKL Optimization Problem). Given a loss function £{t,y) : K x 3^ K 
convex in the first argument and kernel weights f3 = {fijm), solve 


TyX 5 : '^^cY^iu,y.) 

S.t. [cj(Xi) Y {wY ^4‘m{Xi))] +b = & fin- 

jGNi mGNM 


Pll2 


(C.l) 
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Problem C.2 (Dual CLMKL Problem—Partially Dualized). Given a loss function £{t, y) : R x 
3^ —> K convex in the first argument and kernel weights (3 = (Pjm), solve 


“9 X! X! (C.2) 




jGNimGNM jGN„ 


iGN„ 


For any feasible dual variables, the primal variable w'"''\a) minimizing the associated Lagrangian saddle 
problem is 

w‘f^\a) = I5jm X! (C.3) 

iGN„ 


Dualization. With the Lagrangian multipliers ai,i G N„, the Lagrangian saddle problem of Eq. o i 


IS 


sup inf > > 




' ,t ,b 2 Q /J rin^ 

iGNi mGNM ^ 


c ^ i{U,yi) 


iGN„ 




iGNn jGNi 


mGNM 


= sup <1 - C' X! sup[-.((L, yi) - - sup Y, ctib- 

iGN„ *’ ^ iGN„ 

1 


Yl Y ^'^P Y Pl^<^iCj{Xi)<t>m{Xi) - \\\wY\\l 

jGNi mGNjvf wf"'’ Pf™ 

= sup Y /3f™|| XI 

I^ieN„ “i=0 


iGN„ jGNi mGNM iGN„ 

From the above deduction, the variable (a) is a solution of the following problem 

w'f^\a) = arg min [(u, X Ol^|3jmCj{xi)(j)m{xi)) - ^HuHa] 

V^Hrr ' 




and it can be directly checked that this [a) can be analytically represented by 

Wj (ci) — jdjm ^ ^ tXiCj (^Xi^frni^Xif 


jGN„ 


(C.4) 


□ 


Plugging the Fenchel conjugate function of the hinge loss and the e-insensitive loss into Problem C.2 


we have the following partial dual problems for the hinge loss and e-insensitive loss. Here k is the kernel 
defined in Eq. •ij- 


Problem C.3 (Dual CLMKL Problem—Partially Dualized eor Hinge Loss). 

sup X“*“^ X y^yiO‘^ai^Xi,Xl) 

ieN„ i,i6N„ 

S.t. Y <^iy^ = 0 

iGN„ 

0 < cii < C Vi e N„. 
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Problem C.4 (Dual CLMKL Problem—Partially Dualized for e-iNSENSiTivE Loss). 

Z,iGNTi "i^Nn iGNti 

-«*”) = 0 

iGN„ 

0 < af,a~ < C,afa~ =0, Vi S N„. 


D Details on Our Implementation of Localized MKL 


Gonen and Alpaydin M give the first formulation of localized MKL algorithm by using gating model 
rjmix) oc exp((um,a;) + Vmo) to realize locality, and optimize the parameters Vm,VmO,'fTT' € Nm with 
a gradient descent method. However, the calculation of the gradients requires 0{n'^M‘^d) operations in 
Gonen and Alpaydin lfT4ll . which scales poorly w.r.t. the dimension d and the number of kernels. Also, 
the definition of gating model requires the information of primitive features, which is not accessible in 
some application areas. For example, data in bioinformatics may appear in a non-vectorial format such 
as trees and graphs for which the representation of the data with vectors is non-trivial but the calculation 
of kernel matrices is direct. Although Gonen and Alpaydin ifThll propose to use the empirical feature map 
= [kg{xi,x), ..., kg{xn, a;)] to replace the primitive feature in this case (kg is a kernel), this turns out 
to not strictly obey the spirit of the gating function: the empirical feature does not reflect the location of 
the example in the feature space induced from the kernel. Furthermore, with this strategy the computation 
of gradient scales as which is quite computationally expensive. In this paper, we give a natural 

definition of the gating model in a kernel-induced feature space, and provide a fast implementation of the 
resulting LMKL algorithm. Let kg be the kernel used to define the gating model, and let (j)o be the associated 
feature map. Our basic idea is based on the discovery that the parameter Vm can always be represented as a 
linear combination of (j>o{xi),..., (j>o(xn), so the calculation of the representation coefficients is sufficient 
to restore Vm- We consider the gating model of the form 


Vmix) 


exp {{VraAoix) +Vmo) 

exp {{Vfa, (/>o(x) + UAo) ’ 


Gonen and Alpaydin in proposed to optimize the objective function 


Jiv) ■=^oti- 

iGN„ 


^ ■nm{xi)k^(xi,xi)r]^{xi)\ 
iGNri mGNM 


with a gradient descent method. The gradient of J{v) can be expressed as: 

= - X! X! X! Oi^<^iy^y^Vrh(x^)k^(xi,Xl)r]^{x^)(|)o{x^)[S'^-r]Jn{x^)], (D.l) 

^ mGNM 

where = 1 if m = m and 0 otherwise. Let ^;(‘) = . . ,V^^) be the value of u = (wi,..., vm ) at the 

f-th iteration and let to) be the representation of in terms of (/)o(xi), i.e., x^*\i,m)(j)o{xi). 

Analogously, let (i, to) be the representation coefficient of dJ{v^^^)/dvm in terms of (j)o{xi). Introduce 
two arrays for convenience: 


B{i,m) = ^ aiyikm{xr,xi}irirn{xi), A{i) 


r]m{xi)B{i,m), i e'Nn,m gNm- 

m^NM 
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Eq.^ then implies that 

g^*\i,m) = - ^ ^ aiaiJ/*yj7?m(a:*)fcA(xi,x-)pA(x-)[5™ - gm{xi)\ 

mGNM 

=-a^y^gm(x^)^J2 aiyir]fa{.Xi)kfa{,x^,xi)gfa{.xi) 

iGN„ iGN„ rnGNM 

= -aiyigyri{xi)[B{i,m) - A[i)\. 

(D.2) 


With the line search = Vm + —'^-,m € Nm, the representation coefficient can be simply 

updated by taking 

Vi S N„,m S Nm- 

Also, in the calculation of gating model, we need to calculate {4>oixi),Vm), and this can be fulfilled by 

{(j)o{x,),v^^) = ^ {(j)o{x^),r^*'’{i,m)(j)o{xJ)) = ^ ko{xi, xj)A*^{i,m). 

iGN„ iGN„ 

At each iteration, we can use 0{n?M) operations to calculate the arrays A, B. Subsequently, the calculation 
of the gradients as illustrated by Eq. ( |D.2| i can be fulfilled with 0{nM) operations. The updating of the 
representation coefficients (i, m) requires 0{nM) operations, while calculating the gating model gmixi) 
requires further 0{n'^M) operations. Putting the above discussions together, our implementation of LMKL 
based on the kernel trick requires 0{n?M) operations at each iteration, which is much faster than the original 
implementation in Gonen and Alpaydin iflTll with 0{n^M'^d) operations at each iteration. Here, d is the 
dimension of the primitive feature. 


E Background on the Experimental Setup and Empirical Results 

E.l Details on the Protein Fold Prediction Experiment 

We precisely replicate the experimental setup of previous experiments by Campbell and Ying ||7l, Kloft 
f24\ . Kloft and Blanchard ll25l . so we use the train/test split supplied by Campbell and Ying 17| and perform 
CLMKL via one-versus-all strategy to tackle multiple classes. We apply kernel k-means to the uniform 
kernel to generate a partition with 3 clusters for CLMKL and HLMKL, and, since we have no access to 
primitive features, use this kernel to define gating model in LMKL. All the base kernel matrices are mul- 
tiplicatively normalized before training. We validate the regularization parameter C over 
and the average evenesses over the interval [0.4, 0.7] with eight linearly equally spaced points. Note that the 
model parameters are tuned separately for each training set and only based on the training set, not the test 
set. We repeat the experiment 15 times. 

E.2 Details on the Visual Image Categorization Experiment 

We compute 9 bag-of-words features, each with a dictionary size of 512, resulting in 9 x^-Kernels E). 
The first 6 bag-of-words features are computed over SILT features 1^ at three different scales and the two 
color channel sets RGB and opponent colors E). The remaining 3 bag-of-words features are computed 
over quantiles of color values at the same three scales. The quantiles are concatenated over RGB channels. 
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For each channel within a set of color channels, the quantiles are concatenated. Local features are extracted 
at a grid of step size 5 on images that were down-scaled to 600 pixels in the largest dimension. Assignment 
of local features to visual words is done using rank-mapping 0. The kernel width of the kernels is set to 
the mean of the x^-distances. All kernels are multiplicatively normalized. 

The dataset is split into 11 parts for outer cross validation. The performance reported in Table is the 
average over the 11 test splits of the outer cross validation. For each outer cross validation training split, a 
10-fold inner crossvalidation is performed for determining optimal parameters. The parameters are selected 
using only the samples of the outer training split. This avoids to report a result merely on the most favorable 
train test split from the outer cross validation. For the proposed CLMKL we employ kernel k-means with 3 
clusters on the outer training split of the dataset. 

We compare CLMKL to regular £p-norm MKL OTl and to localized MKL as in lfT4l . For all methods, 
we employ a one-versus-all setup, running over £p-norms in {1.125,1.333, 2} and regularization constants 
in {10^/^I^^Q (optima attained inside the respective grids). CLMKL uses the same set of £p-norms, regular¬ 
ization constants from {10^/^}fc=o,...,5^ and average excesses in (0.5 + t/12}^Jg. Performance is measured 
through multi-class classification accuracy. 
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